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Abstract 

Recently, new approaches to adaptive control 
have sought to reformulate the problem as a 
minimization of a relative entropy criterion 
to obtain tractable solutions. In particular, it 
has been shown that minimizing the expected 
deviation from the causal input-output de- 
pendencies of the true plant leads to a new 
promising stochastic control rule called the 
Bayesian control rule. This work proves the 
convergence of the Bayesian control rule un- 
der two sufficient assumptions: boundedness, 
which is an ergodicity condition; and consis- 
tency, which is an instantiation of the sure- 
thing principle. 
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1. Introduction 

When the behavior of a plant under any control signal 
is fully known, then the designer can choose a con- 
troller that produces the desired dynamics. Instances 
of this problem include hitting a target with a can- 
non under known weather conditions, solving a maze 
having its map and controlling a robotic arm in a man- 
ufacturing plant. However, when the behavior of the 
plant is unknown, then the designer faces the problem 
of adaptive control. For example, shooting the cannon 
lacking the appropriate measurement equipment, find- 
ing the way out of an unknown maze and designing an 
autonomous robot for Martian exploration. Adaptive 
control turns out to be far more difficult than its non- 
adaptive counterpart. Even when the plant dynam- 
ics is known to belong to a particular class for which 
optimal controllers are available, constructing the cor- 
responding optimal adaptive controller is in general 
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intractable even for simple toy problems (Duff, 2002). 
Thus, virtually all of the effort of the research commu- 
nity is centered around the development of tractable 
approximations. 

Recently, new formulations of the adaptive control 
problem that are based on the minimization of a rela- 
tive entropy criterion have attracted the interest of the 
control and reinforcement learning community. For 
example, it has been shown that a large class of op- 
timal control problems can be solved very efficiently 
if the problem statement is reformulated as the min- 
imization of the deviation of the dynamics of a con- 
trolled system from the uncontrolled system (Todorov, 
2006; 2009; Kappen et al, 2009). A similar approach 
minimizes the deviation of the causal input/output- 
relationship of a Bayesian mixture of controllers from 
the true controller, obtaining an explicit solution called 
the Bayesian control rule (Ortega & Braun, 2010). 
This control rule is particularly interesting because it 
leads to stochastic controllers that infer the optimal 
controller on-line by combining the plant-specific con- 
trollers, implicitly using the uncertainty of the dynam- 
ics to trade-off exploration versus exploitation. 

Although the Bayesian control rule constitutes a 
promising approach to adaptive control, there are cur- 
rently no proofs that guarantee its convergence to the 
desired policy. The aim of this paper is to develop a 
set of sufficient conditions of convergence and then to 
provide a proof. The analysis is limited to the simple 
case of controllers having a finite amount of modes of 
operation. Special care has been taken to illustrate 
the motivation behind the concepts. 

2. Preliminaries 

The exposition is restricted to the case of discrete time 
with discrete stochastic observations and control sig- 
nals. Let O and A be two finite sets of symbols, where 
the former is the set of inputs (observations) and the 
second the set of outputs (actions). Actions and obser- 
vations at time t are denoted as at G A and o t G O re- 
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spectively, and the shorthand a<t '■= a\, a>2, ■ ■ ■ , a>t and 
the like are used to simplify the notation of strings. 
Symbols are underlined to glue them together as in 
ao <t = ai, 01, 02, 02, . . . , at, 0t- It is assumed that the 
interaction between the controller and the plant pro- 
ceeds in cycles t — 1,2,... where in cycle t the con- 
troller issues action at and the plant responds with an 
observation Ot- 

A controller is defined as a probability distribution 
P over the input/output (I/O) stream, and it is fully 
characterized by the conditional probabilities 

P(a t \ao <t ) and P(o t \ao <t a t) 

representing the probabilities of emitting action a t and 
collecting observation ot given the respective I/O his- 
tory. Similarly, a plant is defined as a probability dis- 
tribution Q characterized by the conditional probabil- 
ities 

Q(o t \ao <t a t ) 

representing the probabilities of emitting observation 
Ot given the I/O history. 

If the plant is known, i.e. if the conditional proba- 
bilities Q(ot\aq <t at) are known, then the designer can 
build a suitable controller by equating the observation 
streams as P(ot\ao <t at) = Q{ot\ao <t at) and by defin- 
ing action probabilities P(at\ao <t ) such that the re- 
sulting distribution P maximizes a desired utility crite- 
rion. In this case P is said to be tailored to Q. In many 
situations the conditional probabilities P(a t \ao <t ) will 
be deterministic, but there are cases (e.g. in repeated 
games) where the designer might prefer stochastic poli- 
cies instead. 

If the plant is unknown then one faces an adaptive 
control problem. Assume we know that the plant 
Q m is going to be drawn randomly from a set Q := 
{Qm}meM 01 possible plants indexed by M. As- 
sume further we have available a set of controllers 
V := {P m }meMi where each P m is tailored to Q m . 
How can we now construct a controller P such that 
its behavior is as close as possible to the tailored con- 
troller P m under any realization of Q m € Q? 

3. Bayesian Control Rule 

A naive approach would be to minimize the relative 
entropy of the controller P with respect to the true 
controller P m , averaged over all possible values of m. 
However, this is syntactically incorrect. The impor- 
tant observation made in Ortega & Braun (2010) is 
that we do not want to minimize the deviation of P 
from P m , but the deviation of the causal I/O depen- 
dencies in P from the causal I/O dependencies in P m . 



Intuitively speaking, one does not want to predict ac- 
tions and observations, but to predict the observations 
(effect) given actions (causes). More specifically, they 
propose to minimize a set of (causal) divergences C 
defined by 



C := lim sup J2 p ( m ) Ct 



(1) 



T=l 



where 



C T := P m (ao <T )C T (ao <T ) 



and where P(m) is the prior probability of m £ M., a T 
denotes an intervened (not observed) action at time t, 
and a 1, S2, S3, ... is an arbitrary sequence of intervened 
actions that gives rise to a particular instantiation of 
C. 

In Ortega & Braun (2010), it is shown that the con- 
troller P that minimizes C in Equation (1) for any 
sequence of intervened actions is given by the condi- 
tional probabilities 



P{a t \aq <T ) := 2J P m {at\ao <T )P(m\ao <T ) 
ao <T ) ■— ^ ] P TO (ot|ao <T a-r) P(m\ao <T ) 



(2) 



where 



p/ 1- \ PrniotlQQ^a^Pimlao^) 

P(m\ao <t ) := — , , r. 3) 

Em' - P m'(o t |ao< t at)P(m'|ao <t ) 

Equations (2) and (3) constitute the Bayesian control 
rule. This result is obtained by using properties of 
interventions using causal calculus (Pearl, 2000). It is 
worth to point out that the resulting controller is fully 
defined in terms of its constituent controllers in V . It 
is customary to use the notation 

P(a t \m,ao <t ) := P m (a t \ao <t ) 
P(o t \m,aq <t a t ) := P m (o t \ao <t a t ), 

that is, treating the different controllers as "hypothe- 
ses" of a Bayesian model. In the context of the 
Bayesian control rule, these "I/O hypotheses" are 
called operation modes. Note that the resulting control 
law is in general stochastic. 

4. Policy Diagrams 

A policy diagram is a useful informal tool to analyze 
the effect of control policies on plants. Figure 1, il- 
lustrates an example. One can imagine a plant as a 
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state space 





Figure 1. A policy diagram. 

collection of states connected by transitions labeled by 
I/O symbols. For instance, Figure 1 highlights a state 
s where taking action a £ A and collecting observa- 
tion o £ O leads to state s'. In a policy diagram, 
one abstracts away from the underlying details of the 
plant's dynamics, representing sets of states and tran- 
sitions as enclosed areas similar to a Venn diagram. 
Choosing a particular policy in a plant amounts to 
partially controlling the transitions taken in the state 
space, thereby choosing a subset of the plant's dynam- 
ics. Accordingly, a policy is represented by a subset in 
state space (enclosed by a directed curve) as illustrated 
in Figure 1. 

Policy diagrams are especially useful to analyze the 
effect of policies on different hypotheses about the 
plant's dynamics. A controller that is endowed with 
a set of operation modes A4 can be seen as hav- 
ing hypotheses about the plant's underlying dynam- 
ics, given by the observation models P(o t \m,ao <t at), 
and associated policies, given by the action models 
P(at\m,ao <t ), for all m £ M. For the sake of sim- 
plifying the interpretation of policy diagrams, we will 
assume 1 the existence of a state space S and a function 
T : (A x O) — > S mapping I/O histories into states. 
With this assumption, policies and hypotheses can be 
seen as conditional probabilities 

P(a,t\m,s) :— P{at\m, ao <t ) 
and P(pt\m, s, a t ) :— P(ot\m, ao <t at) 

respectively, defining transition probabilities 
P(s'\m, s) = ^2 p (QQt\ m , s ) 

S' 

for a Markov chain in the state space, where s = 
T(ao <t ) and S' contains the transitions ao t such that 
T{ao< t ) = s'. 

5. Divergence Processes 

One of the obvious questions to ask oneself with re- 
spect to the Bayesian control rule is whether it con- 



verges to the right control law or not. That is, whether 
P(at\ao t ) P(at\m* , ao <t ) as t — ¥ oo when to* is the 
true operation mode, i.e. the operation mode such 
that P(at\m* . ao <t ) = Q{at\aq <t ). As will be obvious 
from the discussion in the rest of this paper, this is in 
general not true. 

As it is easily seen from Equation 2, showing conver- 
gence amounts to show that the posterior distribution 
P(m\aq <t ) concentrates its probability mass on a sub- 
set of operation modes Ai* having essentially the same 
output stream as to*, 



P(a t \m,ao <t )P(m\ao <t ) 

Y P{a t \m*,ao <t )P(m\ao <f ) 



rneM* 

P(a t \m*,ao <t ). 




Figure 2. Realization of the divergence processes 1 to 4 as- 
sociated to a controller with operation modes mi to rru. 
The divergence processes 1 and 2 diverge, whereas 3 and 4 
stay below the dotted bound. Hence, the posterior proba- 
bilities of mi and mi vanish. 

Hence, understanding the asymptotic behavior of the 
posterior probabilities 

P{m\ao<t) 

is the main goal of this paper. In particular, one wants 
to understand under what conditions these quantities 
converge to zero. The posterior can be rewritten as 



P(m\ao 



P(ao <t \m)P(m) 
Em'eMPim<t\m')P(m') 

P ( m ) P (°r l TO , «Q <T tt T ) 

Em'GJW p ( m ') Ilt=i P(o T \m',ao <T a T ) 



x Note however that no such assumptions are made to 
obtain the results of this paper. 



If all the summands but the one with index to* are 
dropped from the denominator, one obtains the bound 



I- P ( m ) TT P{or\ao <T a T \m) 

P{m\ao<t) < In D/ _^ || 



P(m*) \L P(o T \ao <T a T \m*) 
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which is valid for all m* 6 AA. From this inequality, 
it is seen that it is convenient to analyze the behavior 
of the stochastic process 



dt{m*\\m) := In 



P(o T \m* , ao <T a T ) 
P(o T \m,ao <T a T ) 



which is the divergence process of m from the reference 
m* . Indeed, if dt{m*\\m) — > oo as t — > oo, then 



^PWjT P{o T \ao <T a T \m) 



t-n»P(m*) \± P(o T \aq <T a T \m*) 
. e -dt(m*||m) _ 



t^oo P[m*) 



and thus clearly P{m\dq <t ) —> 0. Figure 2 illustrates 
simultaneous realizations of the divergence processes 
of a controller. Intuitively speaking, these processes 
provide lower bounds on accumulators of surprise value 
measured in information units. 





where the mi, m-i, . . . , nit are drawn themselves from 
P{m 1 ),P(m 2 \aq 1 ), . . . , P{m t \ao <t ). 

To deal with the heterogeneous nature of divergence 
processes, one can introduce a temporal decomposi- 
tion that demultiplexes the original process into many 
sub-processes belonging to unique policies. Let Aft '■= 
{1,2, ... ,t} be the set of time steps up to time t. Let 
T C Aft, and let m,m! G AA. Define a sub-divergence 
of dt(m\\m) as a random variable 

, P(o T \m*,ao <T a T ) 

g{m ; 7 := > In — — : 

^ r P{o T \m,ao <T a r ) 

drawn from 

^C'({aOr}rer|{ao T } rer c) 

J P{a T \m,ao <T )\ \ T[ P(o T \m' ,ao <T a T )\ , 



where := Aft \ T and where |ao^} rC7 -c are given 
conditions that are kept constant. In this definition, 
m! plays the role of the policy that is used to sam- 
ple the actions in the time steps T ■ Clearly, any re- 
alization of the divergence process dt(m*\\m) can be 
decomposed into a sum of sub-divergences, i.e. 



dt(m*\\m) 



^2g(m';T m '), 



(4) 



Figure 3. The application of different policies lead to dif- 
ferent statistical properties of the same divergence process. 

A divergence process is a random walk, i.e. whose 
value at time t depends on the whole history up to time 
t — 1. What makes them cumbersome to characterize 
is the fact that their statistical properties depend on 
the particular policy that is applied; hence, a given 
divergence process can have different growth rates de- 
pending on the policy (Figure 3). Indeed, the behavior 
of a divergence process might depend critically on the 
distribution over actions that is used. For example, 
it can happen that a divergence process stays stable 
under one policy, but diverges under another. In the 
context of the Bayesian control rule this problem is 
further aggravated, because in each time step, the pol- 
icy to apply is determined stochastically. More specifi- 
cally, if m* is the true operation mode, then dt(m*\\m) 
is a random variable that depends on the realization 
ao <t which is drawn from 

t 

J P(a T \m T ,ao< T )P(o T \m*,ao< T a T ), 

T=l 



where {T m }m£M forms a partition of Aft- Figure 4 
shows an example decomposition. 









i 








2 








3 






A 







t 



Figure 4. Decomposition of a divergence process (1) into 
sub-divergences (2 & 3). 

The averages of sub-divergences will play an impor- 
tant role in the analysis. Define the average over all 
realizations of g(m';T) as 

G{m',T) 

■■= E P^({m T heT\{QQ T } Te ^)9{m';T). 
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Notice that for any r G Aft, 

G(m';{r}) 

= P(a T \m' , aq <T )P(o T \m* , aq <T a T ) 



In 



P(o T \m*,aq <T a T ) > Q 



P(o T \m,ao <T a T ) 

because of Gibbs' inequality. In particular, 

G(m*;{r})=0. 

Clearly, this holds as well for any T C Aft- 

Vm' G(m';T)>0, 
G(m*;T) =0. 

6. Boundedness 



(5) 



In general, a divergence process is very complex: virtu- 
ally all the classes of distributions that are of interest in 
control go well beyond i.i.d. and stationary processes. 
This increased complexity can jeopardize the analytic 
tractability of the divergence process, i.e. such that 
no predictions about its asymptotic behavior can be 
made anymore. More specifically, if the growth rates 
of the divergence processes vary too much from real- 
ization to realization, then the posterior distribution 
over operation modes can vary qualitatively between 
realizations. Hence, one needs to impose a stability re- 
quirement akin to ergodicity to limit the class of possi- 
ble divergence-processes to a class that is analytically 
tractable. In the light of this insight, the following 
property is introduced. 

A divergence process dt(m*\\m) is said to be bounded 
in A4 iff for any S > 0, there is a C > 0, such that for 
all m! G M, all t and all Tc Af t 



g(m';T)-G(m';T) 



< C 



with probability > 1 — 5. 



Figure 5 illustrates this property. Boundedness is the 
key property that is going to be used to construct the 
results of this paper. The first important result is that 
the posterior probability of the true operation mode is 
bounded from below. 

Theorem 1. Let the set of operation modes of a con- 
troller be such that for all m £ A4 the divergence pro- 
cess dt(m* \\m) is bounded. Then, for any S > 0, there 
is a A > 0, such that for all t G N, 

P(m*\ao<t) > -p^j 
with probability > 1 — 5. 




Figure 5. If a divergence process is bounded, then the re- 
alizations (curves 2 & 3) of a sub-divergence stay within a 
band around the mean (curve 1). 



Proof. As has been pointed out in (4), a particular 
realization of the divergence process dt{m*\\m) can be 
decomposed as 

dt{m*\\m) = y^ffmt™' ;%„>), 



where the g m {'m l ',T m ') are sub-divergences of 
dt{m*\\m) and the T m ' form a partition of Aft- How- 
ever, since d t (m*\\m) is bounded in A4, one has for all 
5' > 0, there is a C(m) > 0, such that for all m'eAi, 
all t G Aft and all T C Aft, the inequality 

g m {m ;T m >) ~ G m (m ;Tm>) <C(m) 
holds with probability > 1 — 8' . However, due to (5), 
G 

III 

(m; Tm' ) > 

for all m' G M. Thus, 

g m (m';T m >) > -C(m). 

If all the previous inequalities hold simultaneously 
then the divergence process can be bounded as well. 
That is, the inequality 



d t {m*\\m) > -MC(m) 



(6) 



holds with probability > (1 — S') M where M := \M\. 
Choose 

f3(m) := max{0,ln-^^ T }. 



Since > In 



P(rn) 



-/3{m), it can be added to the right 



hand side of (6). Using the definition of dt(m*\\m), 
taking the exponential and rearranging the terms one 
obtains 

/ 

P(m*) Y[ P(o T \m* ,ao <T a T ) 

r=l 

t 

> e - a{m) P(m) Y[ P(o T \m*,ao <T a T ) 
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where a(m) := MC{m) + /3(to) > 0. Identifying the 
posterior probabilities of m* and to by dividing both 
sides by the normalizing constant yields the inequality 

P(m*\ao< t ) > e- a{m) P{m\ao< t ). 

This inequality holds simultaneously for all to G A4 
with probability > (1 — 5') M ~ and in particular for 
A := min m {e- Q ( m )}, that is, 

P{m*\ao <t ) > XP(m\ao^ t ). 

But since this is valid for any m £ .M, and because 
max m {P(m|ao <t )} > jj, one gets 



P(m*|ao< t ) > 



A 

M' 



with probability > 1 — 5 for arbitrary 8 > related to 
<5' through the equation 5' ;= 1 - A/ \/l - 5. □ 

7. Core 

If one wants to identify the operation modes whose 
posterior probabilities vanish, then it is not enough 
to characterize them as those whose hypothesis docs 
not match the true hypothesis. Figure 6 illustrates 
this problem. Here, three hypotheses along with their 
associated policies are shown. Hi and H 2 share the 
prediction made for region A but differ in region B. 
Hypothesis H3 differs everywhere from the others. As- 
sume Hi is true. As long as we apply policy P2, hy- 
pothesis H3 will make wrong predictions and thus its 
divergence process will diverge as expected. However, 
no evidence against H 2 will be accumulated. It is only 
when we apply policy Pi for long enough time that 
the controller will eventually enter region B and hence 
accumulate counter-evidence for H 2 . 




Figure 6. If hypothesis Hi is true and agrees with Hi on 
region A, then policy P2 cannot disambiguate the three 
hypotheses. 

But what does "long enough" mean? If Pi is executed 
only for a short period, then the controller risks not 
visiting the disambiguating region. But unfortunately, 
neither the right policy nor the right length of the pe- 
riod to run it are known beforehand. Hence, the con- 
troller needs a clever time-allocating strategy to test 



all policies for all finite time intervals. This motivates 
following definition. 

The core of an operation mode to*, denoted as [to*], is 
the subset of A4 containing operation modes behaving 
like to* under its policy. More formally, an operation 
mode to ^ [to*] (i.e. is not in the core) iff for any 
G > 0, S, £ > 0, there is a to £ N, such that for all 
t>t , 

G(to*;T) > C 

with probability > 1 — 6, where G(m*;T) is a sub- 
divergence of dt (m* || to), and Pr{r 6 T} > £ for all 

In other words, if the controller was to apply to*'s 
policy in each time step with probability at least £, 
and under this strategy the expected sub-divergence 
G(to*;T) of df (to* I) to) grows unboundedly, then to is 
not in the core of to*. Note that demanding a strictly 
positive probability of execution in each time step 
guarantees that controller will run to* for all possible 
finite time-intervals. As the following theorem shows, 
the posterior probabilities of the operation modes that 
are not in the core vanish almost surely. 

Theorem 2. Let the set of operation modes of a con- 
troller be such that for all m G M. the divergence pro- 
cess dt(m*\\m) is bounded. Then, if to ^ [to*], then 
P(?7i|ao <t ) — > as t — > 00 almost surely. 

Proof. The divergence process cZt(m,*||m) can be de- 
composed into a sum of sub-divergences (see Equa- 
tion 4) 

d t (m*\\m) = ^2g(m';T m >). (7) 

m' 

Furthermore, for every to' € M., one has that for all 
5 > 0, there is a C > 0, such that for all t € N and for 
all T CJV ( 



5 (to';T)-G(to';T) <G(to) 



with probability > 1 — 5' . Applying this bound to the 
summands in (7) yields the lower bound 

5>(m';7^) > Y,(G(m';T m .) - G(to)) 

which holds with probability > (1 — S') M , where M := 
Due to Inequality 5, one has that for all m! 7^ to* , 
G(m';T m >) > 0. Hence, 

J2(G(m';T m ') - G(to)) > G(to*;7;„0 - MC 



where G := max m {C(TO,)}. The members of the set 
Tm* are determined stochastically; more specifically, 
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the i th member is included into T m * with probabil- 
ity P(m*\ao^ i ). But since to ^ [to*], one has that 
G(m* ;T m *) oo as t —> oo with probability > 1 — 5' 
for arbitrarily chosen 6' > 0. This implies that 

lim d t (m*\\m) > lim G(m*;T m *) - MC /> oo 

t— >oo t— >oo 

with probability > 1 — 8, where 6 > is arbitrary and 
related to 6' as S = 1 - (1 - S') M+1 . Using this result 
in the upper bound for posterior probabilities yields 
the final result 



P(m) 

< lim P(m\ao <f ) < lim , v \ t 
- t ^oo V — - t^oo P(m*) 



-d t (m*||m) 



0. 



□ 



8. Consistency 

Even if an operation mode to is in the core of to*, i.e. 
given that to is essentially indistinguishable from to* 
under to*'s control, it can still happen that to* and 
m have different policies. Figure 7 shows an example 
of this. The hypotheses Hi and H2 share region A 
but differ in region B. In addition, both operation 
modes have their policies P\ and P2 respectively con- 
fined to region A. Note that both operation modes are 
in the core of each other. However, their policies are 
different. This means that it is unclear whether mul- 
tiplexing the policies in time will ever disambiguate 
the two hypotheses. This is undesirable, as it could 
impede the convergence to the right control law. 





Figure 7. An example of inconsistent policies. Both opera- 
tion modes are in the core of each other, but have different 
policies. 

Thus, it is clear that one needs to impose further re- 
strictions on the mapping of hypotheses into policies. 
With respect to Figure 7, one can make the following 
observations: 

1. Both operation modes have policies that select 
subsets of region A. Therefore, the dynamics in A 
are preferred over the dynamics in B. 

2. Knowing that the dynamics in A are preferred 
over the dynamics in B allows to drop region B 
from the analysis when choosing a policy. 



3. Since both hypotheses agree in region A, they have 
to choose the same policy in order to be consistent 
in their selection criterion. 

This motivates the following definition. An operation 
mode to is said to be consistent with to* iff to £ [m*] 
implies that for all e < 0, there is a to, such that for 
all t > to and all ao <t at, 



P(at\m* ,ao <t ) — P(at\m* 



'-<i 



< e. 



In other words, if to is in the core of to*, then to's 
policy has to converge to to* 's policy. Intuitively, this 
property parallels the well-known sure-thing principle 
of expected utility theory (Savage, 1954). The follow- 
ing theorem shows that consistency is a sufficient con- 
dition for convergence to the right control law. 
Theorem 3. Let the set of operation modes of a con- 
troller be such that: for all to G M. the divergence 
process dt(m*\\m) is bounded; and for all m,m' £ A4, 
m is consistent with to'. Then, 

P{a t \ao<t) -> P(at\m*, ao<t) 
almost surely as t — > 00 . 

Proof. We will use the abbreviations p m (t) := 
P{at\m,aq <t ) and w m (t) := P(m\ao <t ). Decompose 
P(a t \ao <t ) as 

P(a t \ao<t) = X! Pm(t)w m (t)+ ^ Pm(t)w m (t). 



(8) 



The first sum on the right-hand side is lower-bounded 
by zero and upper-bounded by 



2J p m {t)w m {t) < 



E 

[m*] 



w m {t) 



because p m (t) < 1. Due to Theorem 2, w m (t) 
as t — > 00 almost surely. Given e' > and 6' > 0, 
let to (to) be the time such that for all t > to (to), 
Wm(t) < s'. Choosing to := max m {to(TO,)}, the pre- 
vious inequality holds for all to and t > to simultane- 
ously with probability > (1 — 8') M . Hence, 



Pm(t)w m (t) < ^ W m {t)<Me'. 



(9) 



m^[m*] 



m^[m*] 



To bound the second sum in (8) one proceeds as fol- 
lows. For every member to S [to*], one has that 
Pm{t) — > p m *{i) as t —t 00. Hence, following a sim- 
ilar construction as above, one can choose t' such that 
for all t > t' and to £ [to*], the inequalities 



Pm{t) -p m *(t) 



< 
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hold simultaneously for the precision e' > 0. Applying 
this to the first sum yields the bounds 

< ^2 p m (t)w m (t) 

m£ [m*] 

< (Pm'(t) +e')w m (t). 
m£[m*] 

Here (p m * (t) ± e') are multiplicative constants that 
can be placed in front of the sum. Note that 

1 > w m(t) = 1 - Wm ^ > 1 ~ £ - 

m£[m*] m^[m*] 

Diligently using of the above inequalities allows sim- 
plifying the lower and upper bounds respectively: 



(Pr, 



'(*)- £ J2 Wm(t)>Pm*(t)(l-e')-e' 



> Pm *(t)-2s', 
(t) + e') Yj ^ (*) + £ ' 

m£ [m*] 

<p m «(t)+2£'. 



(10) 



Combining the inequalities (9) and (10) in (8) yields 
the final result: 



P(a t \ao< t ) -Pm*(t) 



< 3e' = e, 



which holds with probability > 1—8 for arbitrary <5 > 
related to 8' as 8' = 1 — y/1 — 8 and arbitrary preci- 
sion e. □ 



9. Summary and Conclusions 

The Bayesian control rule constitutes a promising ap- 
proach to adaptive control based on the minimization 
of the relative entropy of the causal I/O distribution of 
a mixture controller from the true controller. In this 
work, a proof of convergence of the Bayesian control 
rule to the true controller is provided. 

Analyzing the asymptotic behavior of a controller- 
plant dynamics could be perceived as a difficult prob- 
lem that involves the consideration of domain-specific 
assumptions. Here it is shown that this is not the 
case: the asymptotic analysis can be recast as the 
study of concurrent divergence processes that deter- 
mine the evolution of the posterior probabilities over 
operation modes, thus abstracting away from the de- 
tails of the classes of I/O distributions. In particular, 



if the set of operation modes is finite, then two extra 
assumptions are sufficient to prove convergence. The 
first one, boundedness, imposes the stability of diver- 
gence processes under the partial influence of the poli- 
cies contained within the set of operation modes. This 
condition can be regarded as an ergodicity assump- 
tion. The second one, consistency, requires that if a 
hypothesis makes the same predictions as another hy- 
pothesis within its most relevant subset of dynamics, 
then both hypotheses share the same policy. This rel- 
evance is formalized as the core of an operation mode. 

The concepts and proof strategics developed in this 
work are appealing due to their intuitive interpreta- 
tion and formal simplicity. Most importantly, they 
strengthen the intuition about potential pitfalls that 
arise in the context of controller design. The approach 
presented in this work can also be considered as a 
guide for possible extensions to infinite sets of oper- 
ation modes. For example, one can think of partition- 
ing a continuous space of operation modes into "es- 
sentially different" regions where representative oper- 
ation modes subsume their neighborhoods (Griinwald, 
2007). 

Finally, convergence proofs play a crucial role in the 
mathematical justification of any new theory of con- 
trol. Hopefully, this proof will contribute to establish 
relative entropy control theories as solid alternative 
formulations to the problem of adaptive control. 



References 

Duff, M.O. Optimal learning: computational procedures for 
bayes-adaptive markov decision processes. PhD thesis, 
2002. Director-Andrew Barto. 

Griinwald, P. The Minimum Description Length Principle. 
The MIT Press, 2007. 

Kappen, B., Gomez, V., and Opper, M. Optimal control as 
a graphical model inference problem. arXiv.0901.0633, 
2009. 

Ortega, P.A. and Braun, D.A. A bayesian rule for adaptive 
control based on causal interventions. In Proceedings 
of the third conference on general artificial intelligence, 
2010. 

Pearl, J. Causality: Models. Reasoning, and Inference. 
Cambridge University Press, Cambridge, UK, 2000. 

Savage, L.J. The Foundations of Statistics. John Wiley 
and Sons, New York, 1954. ISBN 0-486-62349-1. 

Todorov, E. Linearly solvable markov decision problems. 
In Advances in Neural Information Processing Systems, 
volume 19, pp. 1369-1376, 2006. 

Todorov, E. Efficient computation of optimal actions. Pro- 
ceedings of the National Academy of Sciences U.S.A., 
106:11478-11483, 2009. 



